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O (54) Title: METHOD AND APPARATUS FOR ORDERING ELECTRONIC DATA 

n 

(57) Abstract: The present invention relates to the field of management of data in a computer system. The invention proposes a new 
way of automatically ordering data and arranging them in a data structure in a computer. The invention employs the distance as a 
measure of similarity between data sets. Data sets are assigned to a structure of clusters depending on whether they have a distance 
^ above or below a limiting value that is correlated with a peak in the density of distance values. 
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Claims 

Method of automatically ordering a plurality of sets of electronic data by means of a data 
processing unit, comprising the following steps to be performed by said data processing 
unite 

at least for a selected group of data sets, determining the distance 3D between any 
two data sets, said distance being defined as a function of a pair of two data sets, Render- 
ing a numerical value, said function having a first value Do defined for the case of a pair 
of identical data sets, the difference of the distance D of any pair to said value Do being 
defined to be either greater than or equal zero for all pairs, D-D© £ 0, or less than or equal 
zero for all pahs, D-Do £ 0, 

determining the density of distance values over the range of determined distance 

values, 

determining one or more limiting values, at least some of the limiting values 
defining an nipper boundaxy of a peak in said density of distance values, me^eeSively, if 
said difference is defined to be D-Do ^ 0 for all pairs, and at least some of the limiting 
values defining a lower boundary of a peak, respectively, if said dsf&reaaoe is defined to 
be D-Do £ 0, said limiting values forming an increasing series in case of a plurality of 
limiting values, 

creating correlation data correlating each data set to a dustier in a hierarchy of 
clusters, the number of cluster levels in said hierarchy corresponding to the 
number of limiting values, wherein, 
if said difference is defined to be D-Do £ 0 for all pairs, 

the data sets contained in each first level cluster 'in said Memrohy are insisted to 
one another in that for eacfa date set fee liakim paarwise distance to otter date 
se&s m said chaster is less tSban kw®& MmMmg wlue, 

each higher order ctafoer to saM Mesaaetoy cosngsdses dsfca sets of a gsmop of one 
or more dusters of lower levels, wterean, if said group comprises more tSfean one 
cluster, each chaster in nMs ©romp forms a pair with anoJter dtasfc&r in this gftwup 
for which pair these is at least one data set of one duster of said pair h&mg & 
distance fom a data ses oftSx© otter dtaste of said pssfo ^Mdh is less ten ftst 
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limiting value that is the next higher one in said increasing series of limiting val- 
ues to that limit in g value defining clusters at the next lower level, 
and, if said difference is defined to be D-Do ^ 0 for all pairs, 

the data sets contained in each first level cluster in said hierarchy ace related to 
one another in that for each data set the maximum pairwise distance to other data 
sets in said cluster is greater than the highest limiting value, 
each higher order cluster in said hierarchy comprises data sets of a group of one 
or more clusters of lower levels, wherein, if said group comprises more than one 
cluster, each cluster in this group forms a pair with another cluster in this group 
for which pair there is at least one data set of one cluster of said pair having a 
distance from a data set of the other cluster of said pair, which is greater than that 
limiting value that is the next lower one in said increasing series of limiting val- 
ues to that limiting value defining clusters at the next lower level. 

2. Method according to claim 1, characterised in that it comprises the step of creating data cor- 
relating each data set to a cluster in a hierarchy of clusters, the number of cluster levels in 
said hierarchy corresponding to the number of limiting values, 
wherein, if said difference is defined to be D-D 0 £ 0 for all pahs, 

each first level cluster in said hierarchy comprises at least one data set to which 
all other data sets of said duster have a distance less than the lowest limiting 
value, 

each second level cluster in said hierarchy comprises at least one data set to 
which all other data sets of said cluster have a distance less than the second low- 
est limiting value, 

each higher order cluster in said hierarchy comprises at least one data set to 
which all other data sets of said cluster have a distance which is less than that 
limiting value that is the next higher one in said increasing series of limiting val- 
ues to that imtftfng value defining clusters at the next lower level, 
and, if said difference D-D 0 ^ 0 for all pairs, 

each first level cluster in said hierarchy comprises at least one data set to which 
all other data sets of said cluster have a distance greater than the highest limiting 
value, 
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each second level cluster in said hierarchy comprises at least one data set to which 
all other data sets of said cluster have a distance greater than the second highest 
limiting value, 

each higher order cluster in said hierarchy comprises at least one data set to which 
all other data sets of said cluster have a distance which is greater than that limiting 
value that is the next lower one in said increasing series of limiting values to that 
limiting value defining clusters at the next lower level . 

3. Method according to claim 1 or 2, characterized in that said data correlating the data sets 
to clusters comprise: 

data correlating each data set to one or more first level clusters, 

data correlating each cluster at a level less than the highest level to a cluster or a 

plurality of clusters at a higher level 

4. Method according to one of claims 1 to 3, characterized by the step of controlling a dis- 
play device on the basis of said correlation data to create a graphic symbolic display of 
clusters at one or more levels. 

5. Method according to one of claims 1 to 4, characterized by the step of creating a direc- 
tory structure on the basis of said correlation data, each cluster corresponding to a direc- 
tory and each cluster level to a directory level 

6. Method according to one of claims 1 to 5, characterized by the step of creating a database 
from said data sets and said correlation data, the data model of said data base being de- 
fined by said hierarchy of clusters. 

7. Method according to claim 6, characterized in that the database is a relational data base, 
wherein the keys are defined by cluster names and the values are defined by the name of 
file parent cluster. 

8. Method according4o claim 6, characterized in that the database is an object oriented data 
base, wherein the keys are defined by cluster names and the values are defined by the 
name of the parent directory. 
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9. Method according to one of claims 1 to 8, characterized in that a group of data sets com- 
prising one or more predetermined data elements is selected and said limiting values are 
determined on the basis of said selected group of data sets. 

10. Method according to one of claims 1 to 9, characterized in that the total range of distance 
values is completely partitioned into a sequence of distance intervals and said density of 
distance values is determined as the number or normalized number of distance values in 
each distance interval. 

11. Method according to claim 10, characterized in that a plurality of partitionings of said 
total range of distances with increasing interval size is established and said density is es- 
tablished for each of said partitionings, and that preliminary limiting values are deter- 
mined for each partitioning and optimized limiting values are obtained by averaging or 
fitting said preliminary limiting values, wherein said correlation data are established on 
the basis of said optimized limiting values. 

12. Method according to claim 10 or 11, characterized in that a distribution of distance den- 
sity values is established from said partitioning and said limiting values are determined 
from said distribution. 

13. Method according to one of claims 1 to 12, characterized in that one or more limiting 
values are determined as a minimum or zero point of the density adjacent to a maximum 
of said density. 

14. Method according to claim 1 to 13, characterized in that a curve is fitted to density values 
and one or more Kmfting values are determined as the point of a minimum or zero adja- 
cent to a maximum of said curve. 

15. Method according to claim 1 4, characterized in that said curve is fitted to a distribution of 
density values. 

16. Method according to one of claims 14 or 15, characterized in that said curve is a poly- 
nomial or a trigonometric function or a function of trigonometric functions. 
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17. Method according to one of claims 1 to 16, characterized in that said data sets comprise 
text data and said distance is a function of the number of common words of two data sets. 

18. Method according to one of claims 1 to 16, characterized in that said data sets comprise 
genetic information and said distance is a function of the number of identical data ele- 
ments succeeding one another in two partial sequences in said data sets. 

19. Method according to one of claims 1 to 18, characterized in that the step of creating cor- 
relation data comprises 

establishing a distance matrix for all data sets, 

assigning data sets to a first level cluster that are linked by matrix elements having a 
value less than the lowest limiting value for D S: Do or greater than the highest limit- 
ing value for D £ Do. 

20. Method according to one of claims 1 to 19, characterized in that the data sets are dis- 
played graphically as vertices connected to every other vertex by edges, the length of 
each edge corresponding to the distance between two data sets, that edges having a length 
less than that corresponding to the lowest limiting value are removed and data sets repre- 
sented fay a connected remaining subgraph are assigned to the same cluster at the lowest 
level 

21. Database obtainable by a method of one of claims 1 to 20. 

22* Computer program adapted to perform all steps of a method according to claim 1 or any 
claim dependent thereon. 

23. Computer pro gram according to claim 22 embodied in a computer readable medium. 

24. Apparatus for automatically ordering a plurality of sets of electronic data according to 
their similarities, comprising data processing means performing the steps of a method ac- 
cording to one of claims 1 to 20. 
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25 . Apparatus according to claim 24, characterized in Oat it comprises a display device and 
said data processing means controls said display device according to a method according 
to claim 4. 

26. Apparatus according to one of claims 24 or 25, characterized in that it comprises data 
storage means for storing said data sets in a directory structure, said directory structure 
being obtainable according to a method according to claim 5. 

27. Method of operating an apparatus for searching and/or ordering data sets, said apparatus 
containing or being capable of obtaining correlation data obtainable according to one of 
claims 1 to 20, characterized by the following steps: 

inputting data elements, 

selecting data sets comprising these data elements, 

selecting a cluster at the lowest level in a hierarchy of the selected data sets 
defined by said correlation data, 
whereupon said apparatus outputs data related to the elements of said selected cluster. 

28. Method according to claim 27, characterized in that the apparatus, having outputted data 
related to the elements of said selected cluster outputs data related to the elements of the 
next higher order cluster comprising said selected duster and not contained in said se- 
lected cluster. 

29. Method according to one of claims 27 or 28, characterized in that said apparatus proceeds 
with outputting data related to elements of at least one higher order cluster only upon a 
related input by a user. 



